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Abstract 

Improving both accuracy and computational performance of numerical tools is a major challenge for 
seismic imaging and generally requires specialized implementations to make full use of modern parallel 
architectures. We present a computational strategy for reverse-time migration (RTM) with accelerator- 
aided clusters. A new imaging condition computed from the pressure and velocity fields is introduced. 
The model solver is based on a high-order discontinuous Galerkin time-domain (DGTD) method for the 
pressure-velocity system with unstructured meshes and multi-rate local time-stepping. We adopted the 
MPI-I-X approach for distributed programming where X is a threaded programming model. In this work 
we chose OGGA, a unified framework that makes use of major multi-threading languages {e.g. CUBA and 
OpenCL) and offers the flexibility to run on several hardware architectures. DGTD schemes are suitable 
for efficient computations with accelerators thanks to localized element-to-element coupling and the 
dense algebraic operations required for each element. Moreover, compared to high-order finite-difference 
schemes, the thin halo inherent to DGTD method reduces the amount of data to be exchanged between 
MPI processes and storage requirements for RTM procedures. The amount of data to be recorded during 
simulation is reduced by storing only boundary values in memory rather than on disk and recreating 
the forward wavefields. Computational results are presented that indicate that these methods are strong 
scalable up to at least 32 GPUs for a three-dimensional RTM case. 


1 Introduction 


Reverse-time migration (RTM), introduced in the early 80s Baysal et al. 1983 Loewenthal 1983 McMechan 


1983 

Whitmore 

1983, Lailly, 1983 

Tarantola 


1984 , is a migration method based on the wave equation. 


Although not the only means of obtaining images of the subsurface from seismic data, it is currently widely 
used in the oil & gas industry Etgen and Michelena 2010 . RTM constructs a subsurface image by corre¬ 


[e.g. 


Claerbout , 1985 Bleistein et al. 2001 


lating source wavefields with time-reversed receiver wavefields 
However, the procedure is very demanding in terms of computational and memory resources. It can be im¬ 
plemented as a single-step procedure or as an iterative scheme, minimizing the difference between observed 
primary reflections and synthetic data modelled by the Born approximation of the wave equation. At each 
iteration, the solutions of two numerical simulations have to be computed and cross-correlated: a forward 
simulation to determine the source wavefields in a background model that is reflection-free in the seismic 
bandwidth and a reverse-time wavefield generated by the data or data residuals at the receivers. In typical 
industrial applications, both simulations are expensive and must be performed on clusters with many-core 
CPUs and, optionally, with accelerators such as GPUs to further boost the performance. It is therefore 
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critical to develop numerical methods that are both efficient and scalable on such clusters. Also, strategies 
have to be found to compute the RTM cross-correlation without penalizing the final run time. 

Nowadays, finite-difference methods are the most widely used numerical schemes. A large literature is 
available, notably to reach high-order convergence rates and to reduce dispersion and dissipation errors [see 

for a review]. Alternatively, several kinds of finite-element methods have been 


Virieux et al. 2011 


e.g 

proposed, such as spectral finite-element methods Komatitsch and Vilotte 1998 Komatitsch and Tromp 


1999 , continuous mass-lumped finite-element methods |Chin-Joe-Kong et al. 1999, Cohen et al., 2001 


or 


discontinuous Galerkin (DG) methods Gollis et al. 

2010 

Dumbser and Kaser 

2006 

Etienne et al. 

2010 

Krebs et al. 

2014 

Mercerat and Glinsky 

2015 . In contrast to the standard finite-element approach, those 


methods do not require the solution of large sparse linear systems of equations, but if a lumping procedure or 
a block diagonal mass matrix is chosen, it is possible to use an explicit time-stepping scheme. Such methods 
have some advantages over finite-differences. They can easily handle geometric or physical discontinuities 
by using unstructured meshes without suffering from the loss of accuracy that afflicts most finite-difference 
schemes. Moreover, a hybrid discretization can be deployed thanks to the flexibility in mixing several kinds of 


elements and different discretization orders. In comparative work Baldassari et al., 2012, Minisini et al. 2012 


Moczo et al., 2011 Zhebel et ah, 2014 , these finite-element methods generally exhibit similar performance 


results and they surpass classic finite-difference schemes in cases with geometric or physical discontinuities. 
However, in three dimensions, spectral elements are currently only available for hexahedral meshes, which are 
less flexible than tetrahedral meshes. Continuous mass-lumped finite elements are available for tetrahedrons 
only up to third-degree polynomial basis functions, enabling at most a fourth-order spatial convergence of the 
scheme. The DG approach does not suffer from these limitations, but it is generally considered as expensive 
because of the excessively large number of degrees of freedom. In contrast with other methods, DG schemes 
are easily combined with local time-stepping strategies in order to improve computational efficiency and 


speed up of geophysical wave propagation Baldassari et al.l 2011 Dumbser and Kaser, 2009, Minisini et al. 


2013 


Accelerator-aided clusters require specialized algorithms and computational implementations to make 
full use of the hardware. In the computational geophysics community, several implementations have been 
proposed for the numerical modeling of seismic wave propagation problems with clusters of GPUs. They 


are based on finite-difference schemes [Michea and Komatitsch 

2010 

Weiss and Shragge 

2013 , a spectral 

element scheme Komatitsch et al. 

2010 or a discontinuous Galerkin method Mu et al. 2013 . To our 


knowledge, only a few papers have been published for complete RTM procedures on GPUs, and only with 
finite-difference schemes Abdelkhalek et al. 2012 Liu et al., 2013 Yang et al. 2014 . On the other hand. 


Klockner et al. 2009 have proposed GPU implementations for wave-like problems written with time-domain 


first-order wave systems and discretized with nodal DG schemes, but does not address RTM imaging. The 
local element-to-element coupling and the dense algebraic operations required per element make nodal DG 
methods suitable for parallel multi-threading computations, especially with GPUs. This implementation has 


successfully been adapted for several applications Fuhry et al. 2014, Gandham et al. 2015 Godel et al. 


2010, Modave et al. 2015 


A supplementary difficulty for RTM procedures is to compute the cross-correlation of the solutions of 
source and receivers simulations. Because those simulations are performed in opposite time directions, it is 
common to either store the source wavefield at check-points or to reproduce it backward in time, in order 
to compute the cross-correlation during the reverse-time simulation for the receiver wavefield. The source 
wavefield can be reproduced by using a check-pointing strategy or by saving boundary values during a first 
forward run Dussaud et al. 2008, Symes 2007 . However, in both cases, a substantial storage space is 


required as well as memory transfers during the runs, which can slow down the total run time. A strategy 
based on random boundaries can overcome this bottleneck, but the final image can be polluted by noise if the 
coverage of the survey is not dense enough Clapp 2009 Liu et al. 2012 . Several nonreconstructive imaging 


methods have been recently proposed by using alternative imaging conditions Nguyen and McMechan, 2015 


In this paper, we propose a novel implementation, named “RiDG”, for RTM procedures with clusters of 


GPUs. It is based on first-order wave systems, with a high-order nodal DG scheme Hesthaven and Warbur- 


ton 2002 2007 , unstructured meshes made of tetrahedras, and a multi-rate Adams-Bashforth time-stepping 
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Godel et al. 2010 . Using first-order systems instead of second-order allow to investigate alternative imaging 


conditions. We describe here an implementation for the pressure-velocity system and a novel imaging condi¬ 
tion based on the knowledge of both pressure and velocity, which improves final RTM images in comparison 
with the classical condition. 

RiDG is coded with the C-I--I- language, OCCA for multi-threaded programming, and MPI for parallel 
computing with distributed memory. OGGA [Medina et al. 2014 is a unified programming framework that 
abstracts major multi-threading languages (OpenCL, GUDA, pThreads and OpenMP), offering flexibility 
to choose the hardware architecture and thread-model at run-time. RiDG directly follows the strategies of 
Klockner et al. 2009 for GPU computing. We have particularly taken care of memory transfers and memory 


storage that are required for a complete RTM procedure. Despite those transfers, RiDG reaches an excellent 
strong scalability for a three-dimensional case with 32 GPUs. 

This paper is organized as follows. In section 2, the mathematical wave model and the novel imaging 
condition are introduced. Section 3 is dedicated to numerical schemes. The high-order nodal DG scheme, the 
multi-rate Adams-Bashforth time-stepping and the numerical approximation of the imaging condition are 
detailed. In section 4, the computational implementation with OCCA and MPI is described. In particular, 
the OCCA programming framework is introduced, and the management of computational work and memory 
transfer is discussed. Finally, computational results for three-dimensional benchmarks are provided in section 
5. 


2 Wave Model, Imaging Conditions and RTM Procedure 


The RTM implementation described in this paper is based on the acoustic wave model formulated in terms 
of pressure p(x, t) and particle velocity v(x,t), 

/(t)(5(x-Xo), (1) 

0 . ( 2 ) 

In this work, both the density p and phase velocity c are assumed to be positive and piecewise constant. In 
the source term, f{t) is understood as a signal and (5(x — Xq) is the Dirac delta function. In the complete 
mathematical model, a free-surface boundary is usually represented with the boundary condition p = 0, and 
a transparent boundary can be approximated with the boundary condition 

p—pcn-v = 0. (3) 


dp n 

-^+pc V-v = 
9v 


For each set of source and receiver signals, the RTM procedure requires the numerical solutions of two 
wave propagation problems using an initial model of the subsurface material parameters. The ‘source’ 
solution is obtained by emitting the source signal forward-in-time, and the ‘receivers’ solution is computed 
by re-injecting backward-in-time the signals recorded at receivers. The key step of the method then consists 


in computing an imaging condition that correlates both solutions. A classical imaging condition Bleistein 
et al.l|2001[|Claerboutl|1985| is 


/(x) = 


Ps(x,t;s) pR(x,t;s) dt, 


( 4 ) 


where ps(x,t;s) and pR(x,t;s) are the pressure helds of the source and receivers wavehelds for shot s, 
respectively, and t/ is the final time. In the following will neglect the shot index s. Note that equation Q 
is expressed in its simplest form, without “true-amplitude” migration weights, as we want to focus on the 
computational rather than on the geophysical issues. High values of /(x) correspond to geophysical features 
of interest, generally called reflectors. 

While most RTM implementations are based on the second-order wave equation that provides only the 
pressure field, both pressure p and velocity v are available with our hrst-order formulation. This allows 
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US to investigate different definitions for the imaging condition, using both types of wavefields. In his 
foundation paper, Claerbout 1971 introduced the imaging condition as the cross-correlation of “downgoing 
generated by the source and “upgoing waves” that reach the receivers. Formula Q corresponds 

This assumption is valid when using a one- 

The last 


waves 

to this definition if no reflections occur in the underground. 

way wave equation or if the subsurface model is very smooth or has a constant impedance, 
option will be considered here for the pressure-velocity system which is a two-way wave model that 

accurately simulates waves propagating in all directions. Therefore, the resulting imaging procedure does 
not suffer from dip limitations [McMechan 1983 , in contrast to one-way wave-equation models. However, 
a well-known disadvantage of the wave equation is the correlation of forward and time-reversed wavefields 
following the same path, generating long-wave artefacts in the image. Typical examples are diving waves 
and strong refractions. One strategy to overcome this difficulty consists in decomposing the source and 
receivers wavefields into upgoing and downgoing components and only keeping the interesting components 
in the imaging condition [e.g. Hu and McMechan 1987, Liu et al., 2011| . In our case, it is easy to perform 
such a decomposition thanks to the characteristic fields of the first-order wave system Recall that, 

in one dimension with z as the spatial coordinate, the system is equivalent to 


— {p + cpvz)+c—{p + cpv^) = 0, 

cpv^) - c-^{p-cpv^) = 0 , 


(5) 

( 6 ) 


where Vz is the vertical velocity. These equations correspond to one-way wave-equation models, where 
waves propagate upwards and downwards, respectively. By analogy with the previous works, we propose the 
imaging condition 


rtf 

li^) = X! / 9s 

shots ^ 

where q~^ and q~ correspond to the upgoing and downgoing characteristic fields, defined as 

9=^(x,f) = p{K,t)±cpez 


(7) 

( 8 ) 


with the vertical unit vector e^, pointing upwards. By using characteristic fields, the new imaging condition 
Q takes into account upgoing information of the source wavefield and downgoing information of the receiver 


wavefield. For steep dips, the above can be easily generalized along the lines of Liu et al. 2011 by also 


considering decompositions in e^, and e^. A complete study of imaging conditions based on characteristic 
variables is out of the scope of this paper, and will be proposed in future work. 

Approximating imaging conditions Q and Q with classical numerical integration requires the knowledge 
of both source and receivers solutions at each time-step. As mentioned in the introduction, this represents 
a difficulty because these solutions are simulated with opposite time directions. Following [Dussaud et al.| 
2008 , we perform both source and receivers simulations backward-in-time, together with a step-by-step 


computation of the imaging condition. The backward-in-time source simulation is made possible thanks to 
a preliminary forward-in-time run, where both the boundary values of fields and the final solution are saved 
to reproduce the solution in the other (reverse) time direction. 

With a nodal DG scheme, we need to save the pressure p and the normal component of the velocity n • u 
(called traces of fields in the remainder) only at the boundary nodes, even with high-degree basis functions. 
By contrast, when using this strategy with a high-order finite difference scheme, a thick enough halo of data 
must be saved to compute the thick stencil of the scheme at the boundary [see e.g. Yang et al. 2014 


In three dimensions, the amount of data at boundaries increases linearly according to the stencil radius for 
finite difference schemes whereas the halo zone for a nodal DG scheme is just one node deep. The thin halo 
inherent to the nodal DG method then allows for a storage requirement lower than with the widely used 
high-order finite different schemes. 
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3 Numerical Schemes 


In this section, we detail the numerical schemes used to solve for the source and receiver wavefields and to 
evaluate the imaging condition /(x) from their correlation. The spatial discretization is based on a nodal 
discontinuous Galerkin scheme. The time-stepping is performed through a multi-rate scheme with local time 
stepping where needed. These are detailed in sections |3.1| and |3.2[ respectively. The numerical evaluation of 


/(x) is addressed in section 3.3 


3.1 Nodal Discontinuous Galerkin Scheme 

For each problem, the approximate fields are build on a spatial mesh of the computational domain SI C K^, 
made of non-overlapping tetrahedral cells. 


= [J Dfc, 

k=l,...,K 


where K is the number of cells and is the cell. With the nodal discontinuous Galerkin method, the 
pressure field and each Cartesian component of the velocity field are approached by piecewise polynomial 
functions. The discrete unknowns correspond to the values of fields at nodes of elements Hesthaven and 


Warburton 2002 2007 . Over each cell Dfc, the approximate fields are then equal to 


Np 


Pk{^,t) 

— ^ ^ Pk,n (^) •; ^ D/;;, 

— 1 

(9) 

Vi,ki^,t) 

Np 

^ ^ 1 ^ ^k: ^ 

n—1 

(10) 


where Np is the number of nodes per element, Pk,n{t) and Vi^k,n{t) are the values of fields at node n of 
element fc, and ik,n{^) is the associated multivariate Lagrange polynomial function. Denoting the position 
of the node with Xfc_„, we have 


^k,n {^k,m) 


J 1, if TO = n, 

[ 0, if TO ^ n, 


for all TO, n = 1,... ,Np. In this work, the position of nodes is defined using the warp-and-blend technique 
. The number of nodes for each element is given by Np = {N + 1){N + 2){N + 3)/6, where 
N is the maximal order of polynomial functions. 

The spatial discretization is obtained from the variational formulation of the problem. For each cell D^, 
we are looking for p(x, t) G L 2 {Dk,M.'^) and v(x, t) G L 2 (Dfc,IR+) such that 


Warburton 2006 


/ V'(x)dx-H /" i (n • |v] - Tp IpD V'(x) dx = 0 (II) 

JDk \PC Ot / JdDk ^ 

Id Iod 

for all test functions V'(x) G ^2(0^) and '0(x) G L2(D^). In these equations, n is the outward unit normal 
to the cell boundary dDk, and |p] and |v] are the jump of helds at this boundary, i.e. 

bl = 

H = V+-V-, 


where the plus and minus subscripts denote nodal values on the side of the neighboring and current cells, 
respectively. The formulation is stable if the penalty parameters Tp and Tv are non-negative [see 


Warburton 
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2013 


for a general proof]. With a homogeneous medium, the values Tp = 1/ (pc) and Tv = pc correspond to 
classical upwind fluxes. 

The semi-discrete equations are obtained by injecting the approximate representations of fields 
in equations (11|-(12), and using the Lagrange polynomial functions as test functions. For each element k, 
a system of equations can then be written as 


, p 3 3 Nf 

i=lJ=1 f=l 


dt 

dsl 

dt 


1 


Nf 


/=1 


pk r=i 


(13) 

(14) 


where the vectors and contain the discrete unknowns associated to each field for all nodes of element 
k, and the vectors ^ and contain the penalty terms for all face nodes of face /. The first terms of 
the right-hand sides of these equations correspond to the volume integrals with spatial differential operators 
of the variational form, while the second terms correspond to the surface integrals. The matrices Dj and 
are respectively the differentiation matrices and the lifting matrices of the reference element. The geometric 
factors g^l j and depend on the shape of each element. The definitions of the matrices and the factors, 
and a complete derivation of the semi-discrete equations are given in appendix [X| 

A point source term is incorporated into the formulation by adding 


J{t) 5(x-xo)'0(x)dx, 


(15) 


to the right-hand-side of equation (11). A corresponding term will enter equation (13). In the term (15), 
the Dirac delta function is replaced by its L2-projection on the polynomial basis, (5(x — Xq) = Wn(-k,n{'^), 

where the coefficients are obtained by solving the system 

dx = 0, for TO = 1,..., Np. 

For sake of clarity, we finally rewrite the semi-discrete equations as the abstract system 


r 

Np 


6{X - Xo) - ^ W„4,n(x) 

lOk 

n—1 


dc\k 

dt 


= rfe, 


(16) 


where is the vector of all discrete unknowns for element fc, and combines the rights-hand side terms of 
equations (13)-(14). The global semi-discrete scheme is simply built by combining the local systems of all 
elements. 


3.2 Multirate Time-Stepping Scheme 

The resolution in time is performed by using an explicit multi-rate scheme based on the third-order Adams- 
Bashforth method. Such schemes are advantageous in comparison to single-rate ones for problems with 
refined meshes or involving multiscale dynamics. The stability of explicit single-rate schemes is guaranteed 
by the global CFL condition 


At < C min 

k 



where C is a constant that depends on the time integration scheme and the spatial approximation, hk is the 
characteristic size of element fc, and Ck is the corresponding phase velocity. The maximum stable time-step 
At is thus determined by the element having the smallest ratio hk/ck- With multi-rate schemes, the update 
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of approximate fields on each element is performed with a time-step Atk that is local to the element, and 
the stability condition becomes local, 

Atk < C—. (17) 

Larger local time-steps are then allowed for elements having a large ratio hk/ck, thus providing a reduction 
of the computational cost of the global scheme. 

The multi-rate Adams-Bashforth (MRAB) scheme is efficiently implemented by grouping the elements 
into levels with fixed time-steps 

1, the local time-step is given by 

Ati = 

where Aieveis is the number of levels, and Atiocai is the smallest considered time-step. The level of each 
element depends on the maximum local time-step allowed by condition 0 - After a first lumping of elements 
based on this condition, some of them are moved from coarse levels (with larger time-steps) to finer levels 
(with small time-steps) in order to have no more than one level of difference between two neighboring elements 
of the mesh. Therefore, the local time-steps of two neighboring elements are the same or differ with a factor 

2. Let us denote Atgiobai the global time-step, which corresponds to the local time-step of the coarsest level 
(ie. the level 1). For each level I, Ni = 2*“^ local iterations are needed to achieve a global iteration. 

At each local iteration, the discrete unknowns of elements are updated according to the classical third- 
order Adams-Bashforth formula, 


Godel et al.[ 2010 Gandham et al. 2015 e.g. ]. For each element k of level 


n-\-m/Ni 


n 




-\-{7n-l)/Ni 


3 

-I- Ati ^ as 


n+{m-s)/Ni 

'■k 




(18) 


with oi = 23/12, 02 = —16/12 and 03 = 5/12, and where the indices n and m correspond to the global and 
local time-steppings, respectively. This formula requires the knowledge of the right-hand side vector at the 
three previous local steps. For each step, this vector is thus computed with the values of unknowns local to 
the element, as well as those of the neighboring element at each interface, according to equations (13)-(14|. 
However, at the interface between two elements belonging to two different levels, interface unknowns of the 
coarse-level element are available at only every two local steps of the fine-level element. For the intermediate 
step, the unknowns of the coarse-level element evaluated according to the modified Adams-Bashforth formula 


n+{m-'i-/ 2 )/Ni 

qfc 




f^k 


i+(m-s)/Ni 


(19) 


with the coefficients hi = 17/24, 62 = and 63 = 2/24. 

This procedure is straightforwardly adapted for backward-in-time simulations. It suffices to consider that 
time decreases when the indexes n and m increase. No changes are then required in the update schemes, but 
the penalty parameters Tp and now must be strictly negative to stabilize this backward-in-time scheme. 


3.3 Evaluation of Imaging Condition 

We will now focus on the computation of the image /(x), defined by equation Q. The generalization to 
the imaging condition 0 is straightforward. For one sample of source-receivers signals, the time integral 
of I (x) is evaluated time-step by time-step along a backward-in-time computation of both source solution 
Ps{x,t) and receivers solution pu(x,t), as explained in section]^ Each local time-step contribution to the 
time integral, 

/ Ps(x,t) PR(x,t) dt, (20) 
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with = (n + 'm/Ni)At, is computed at each node x and the values are accumulated over the steps 

to provide I (x). 

Integral (20 1 could typically be evaluated using the trapezoidal rule 


2 


'^n+mM(x) 


/ X n- 

W Pk 


n+(m-l)/Ni 



( 21 ) 


which is a first-order approximation in time to the exact integral. We propose a better evaluation considering 
that the fields are obtained with a Adams-Bashforth scheme. Let us recall that this scheme uses on polynomial 
representations of fields 


Ps(x,t) 

PR(x,t) 


3 / 

..I 

..I VO 



.n-|-(m-s)/Wi 

( 22 ) 

.n-|-(m-s)/Wi 

(23) 


with = (t — /Ati, where ^ 2(0 and £ 3 )^) are Lagrange interpolation functions with 

interpolation nodes at 0 , —1 and — 2 , and g^j-g ^j^g components of right-hand 

side vectors corresponding to pressure field. Injecting representations (22|-(23) in integral (20 1 gives 


-f 


-f 


-f 

with the coefficient matrix 


Atf (x) (x) 

S^l 

Si—1 S 2 — I 



(24) 


This formula gives the exact value of integral (20) for fields that are third-order approximation in time. 
Therefore, formula (24) theoretically keeps accuracy of numerical fields with an 0{At^) numerical error, 
while approximation (21) gives an 0{At) error. 

It is worthwhile to mention that this numerical evaluation is based on values of fields and right-hand 
side vectors that are available to update the fields with formula (18). The contribution (24) to the imaging 
condition is thus computed at each step right before the update of fields, without significant supplementary 
cost in comparison with approximation ( 21 ). 


4 Parallel Computation and Implementation 


In this section, we describe the RiDG implementation for massively parallel computations on accelerator- 
aided clusters. The fundamentals and features of OCCA are described in section [4T| Our implementation for 
a single multi-treading device is detailed in section |4.2| and the adaptations for distributed-memory clusters 
are explained in section |4.3| The memory transfers required for MPI communications and for the RTM 


procedure are discussed in section 4.4 
















4.1 Multi-Threading Programming with OCCA 


OCCA is a unified framework for multi-threading programming that can be used for several shared-memory 
hardware architectures, such as CPUs, GPUs and Intel’s Xeon Phi. The current version translates a sin¬ 
gle implementation of computational kernels to the OpenMP, OpenCL, pThreads, Intel COI and CUBA 
languages. It has been developed to maintain portability and performance together with platform-choice 
flexibility. Using this programming approach therefore allows customized implementations of algorithms for 
several computing devices with a single code. 

The OCCA library provides a host API and a device API. The host API gives tools to generates a 
self-contained context and command queue from hardware devices, to allocate and manage device memory, 
and finally to compile and execute kernels. The device API allows us to write kernels that are compiled at 
run-time, and that send instructions to the device when executed. The kernel language of OCCA mirrors 
those used for GPU programming. Threads (work-items) are grouped into groups (work-groups), which 
compose the main grid. The work-groups are queued for execution onto the available multiprocessors, and 
work-items are executed in parallel. The optimum choices of the number of work-groups, and the number 
of work-items per work-group depend on the algorithm and the characteristics of the device in use. These 
number are defined by the user, before the execution of each kernel, thanks to the host API. 

Further information about OCCA can be found in a white paper 
developments are available on the website http://www.libocca.org/. 


Medina et al. 2014 . The latest 


4.2 Implementation with One Multi-Threading Device 

The complete computational procedure for the forward time-stepping is given in algorithmic It is straight¬ 
forwardly adapted for the backward simulation. For each element fc, q^, stores all the local values of helds 
(p and each Cartesian component of v), and qf j,. stores a copy of traces corresponding to face nodes (p and 
n • V, where n is the outward unit normal to the face). When updating the right-hand-side vector, q^, is used 
to compute volume terms, while qfis used for the surface terms. The array rhs/j stores the right-hand side 
vectors corresponding to the three previous local time-steps, niocai and Mocai are the local step index and 
the number of local steps of the finest level. 


Algorithm 1 : Forward time-stepping procedure 


initialization 

initialize q^, for each element k 
initialize qf^ for each element k 
initialize rhs^ for each element k 

for TT-giobal ~ Ij • j Ag[Qi3a[ do 
for niocal = 1,2,..., Aiocal do 

Gtart ~ [(^global l)A^local F (^local 1)] 


At 


local 


^new — ^start H” ^^local 

for each element k with unknowns evaluated at fstart do 
L update rhsfc for t = fgtart 
for I — A^i0Y0[s,..., 2,1 do 

if the unknowns of level I must be evaluated at t^ew then 
for each element k of level I do 
^ update q^ and qf^ for t = tnew with update scheme (181 


for each element k that has not been updated at this local time-step, 
but that is coarse-level neighbour of evaluated elements do 
|_ update qff, for t = tnew with update scheme (19) 
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This global procedure can be naturally decomposed into several computational kernels. These computa¬ 
tional kernels are implemented in separate OCCA kernels, which allows us to optimize each task considering 
the characteristics of both the task and the device. Thereby, the loop iterations of algorithms [T] are performed 
by the main process (on the ‘host’), which calls the different kernels to execute operations on the device. 
Our implementation has three main kernels, which are called at each local time-step in this order: 


1. The volume kernel computes the volume terms and stores the result into array rhs^,. 

2. The surface kernel computes the surface terms and updates rhs*, with the result. 

3. The update kernel performs the local time-stepping with Adams-Bashforth schemes (18) and (19). 


A supplementary kernel (the source kernel) generates signals at sources and receivers by adding a specific 
term in the right-hand side vector. For the RTM procedure, a specific kernel (the image kernel) updates an 
image array Ik with the contribution of the current local time-step to the imaging condition. 

In implementing these kernels, we seek to optimize the use of the computational units of the hardware 
device and minimize waiting time when transferring data, in order to optimize run time. These questions have 
been studied by Klockner et al. 2009 2012 2013 for the implementation of nodal discontinuous Galerkin 


schemes on GPU. The ideas have then been applied with an Adams-Bashforth multi-rate time-stepping by 
Godel et al. 2010 and Gandham et al. 2015 in different wave contexts. 


Memory on the device 

The locality of memory storage is addressed by storing all the data required by the numerical schemes on the 
device {e.g. in the global memory of a GPU). A floating-point array q stores all the discrete unknowns, and 
the array qf contains a copy of traces associated to face nodes. The array rhs is reserved for the right-hand 
side vectors associated to the three previous time-steps. In a RTM procedure, all these arrays are duplicated 
for source and receivers simulations, and the imaging condition is stored in the specific array I. When using 
an accelerator, such as a GPU, these arrays are only stored in its global memory and are never transferred 
to the RAM of the compute node, at any moment, except to export the final RTM image. 

During an initialization procedure, the differentiation matrices Dj and the lifting matrices Ly are pre¬ 
computed and stored on the device in the aggregated arrays Drst and Lift. The Jacobian matrices d'^k/dr 
of all elements are stored in vGeoFac, and both Gartesian the components of normal vector and the Jacobian 
ratios Jkj/Jf are stored in sGeoFac. The physical parameters pk and Ck are managed using a database 
principle: the array phyDB contains the parameters values for each kind of medium, and the mapping between 
medium indexes and mesh cells is stored in the array phyMap. Finally, several arrays list elements associated 
to each multi-rate time-stepping level, and to the coarse neighbours for the MRAB strategy. 

In order to take full advantage of the cache memory, the different nodal values corresponding to a same 
field and a same element are stored contiguously in the array q. When fetching values from the memory to 
device registers, the memory bus operates by blocks of data instead of individual items. Since these nodal 


Symbol 

Dimensions 

Definition 

q 

K * iVfields ‘ -^pad 

Values of fields at nodes 

qf 

K ■ A^traces ’ Nf ■ Nfp 

Values of traces at face nodes 

rhs 

3 • AT • A^flelds • A^pad 

Right-hand side terms 

I 

K ■ A^pad 

Imaging condition 

Drst 

Np ■ .^dim 

Differentiation matrices 

Lift 

Nf ■ Nfp ■ Np 

Lifting matrices 

vGeoFac 

K ■ 

Geometric factors for volume terms 

sGeoFac 

K -A-Nf 

Geometric factors for surface terms 


Table 1: List of main arrays stored in the global memory of the device. Dimensions are given from the 
coarsest to the finest granularity of storage. 


10 






















Algorithm 2: Volume kernel 
input 

pointers *q, *rhs 

pointers *Drst, *vGeoFac, *phyMap, *phyDB 
for each block b of elements do 

shared arrays P, UdotGr, UdotGs, UdotGt (arrays AVikv • ^p) 
shared arrays vGeoFacWG (array ATbikv ’ 9) 
shared arrays phyParWG (array ATbikv ’ 2) 
for each element k of block b do 
for each node n of element k do 

load the geometric factors in vGeoFacWG for element k 
load the physical parameters in phyParWG for element k 

for each element k of block b do 
for each node n of element k do 
read the values of fields from q 
compute the linear combinations for the p equation 
store P, UdotGr, UdotGs and UdotGt 

for each element k of block b do 
for each node n of element k do 

read values of the differentiation matrices from Drst 
compute Dr • UdotGr + Ds • UdotGs + Dt • UdotGt for the p equation 
compute Dr • P, Ds • P and Dt • P, and the linear combinations for the Vi’s equations 
1 _ compute the volume terms and store in rhs 


values must be used together to compute the first matrix-vector products of the right-hand sides of equations 
(13)-(14|, a contiguous storage is advantageous. We improve the utilization of the memory bus by padding 
the blocks of nodal values in q. Vpad array elements are thus reserved instead of Np for each block, where 
Apad > Np is chosen considering the characteristics of the bus. Then, the blocks of Apad array elements 
corresponding to the different fields for a same element are stored contiguously. The coarsest granularity 
of storage thus corresponds to the element indexes. A similar granularity is used for qf, rhs and I. The 
dimensions and granularity of the main arrays are given in table 


Kernels 

The volume and surface kernels, shown in algorithms andare conceived and optimized in a similar way. 
When one of them is executed, the volume or surface terms are computed and stored in rhs for all nodes 
that must be updated at the current local time-step. For both kernels, each thread (work-item) performs 
the operations for the Ageids items of array rhs associated to a given node and the current time-step. Each 
work-group processes the nodes of several elements, Abikv for the volume kernel and Abiks for the surface 
kernel. These parameters provide a way to tune the occupation of the device for each kernel, and then to 
adjust the load balancing. The volume kernel is then compiled for ceil(Ar/Arbikv) work-groups of Abikv ■ Np 
work-items, and the surface kernel is compiled for ceil(A/A'biks) work-groups of ATbiks ■ ma,x{Np, NjNjp) 
work-items. 

The computation of volume and surface terms are efficiently processed in three sub-tasks, which corre¬ 
spond to the three OCCA inner loops of algorithms and First, the geometric factors and the physical 
parameters, which are shared by all the nodes and face nodes of a given element, are transported from the 
global device memory to shared arrays. The second sub-task consists in building the vectors that are used in 


the matrix-vector products of equations (13)-(14|. For the volume kernel, each work-item fetches the fields 


values of its assigned node, computes the linear combinations of velocity unknowns, and stores the results 
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Algorithm 3: Surface kernel 

input 

pointers *qf, *rhs 

pointers *Lift, *sGeoFac, *phyMap, *phyDB, *typeMap 
for each block b of elements do 

shared arrays pPena, uPena, vPena, wPena (arrays A^bikS ’ Nf ■ Nfp) 
shared arrays sGeoFacWG (array Abiks • 4 • Nf) 
shared arrays phyParWG (array ItTbiks • 2) 
for each element k of block b do 
for each node n of element k do 

load the geometric factors in sGeoFacWG for element k 
load the physical parameters in phyParWG for element k 

for each element k of block b do 

for each face node nf of element k do 

read the interior values of fields from qf 
read the type of face from typeMap 
if face at domain boundary then 
L define the exterior values of fields with the condition 
else if face at interface then 
L read the exterior values of fields from qf 
compute the components of penalty vectors 
multiply with the ratios of Jacobians 
store in pPena, uPena, vPena and wPena 

for each element k of block b do 
for each node n of element k do 

read the values of lifting matrices from Lift 

compute Lift • pPena, Lift • uPena, Lift • vPena and Lift • wPena 
|_ add the resulting surface terms to rhs 


in shared arrays. In addition, the right-hand-side vectors of the two previous time-steps are updated. For 
the surface kernel, each work-item fetches the field values of its assigned face node, both for the current and 
neighbouring element. Then, it computes the penalty terms, multiplies them with the ratios of Jacobians, and 
stores the results in shared arrays. Finally, the matrix-vector products are computed in the third sub-tasks 
of kernels, and the results are added to rhs. Each work-item effectively performs only one scalar product 
of vectors for each matrix-vector product. This operation is optimized thanks to the coherent granularity 
of storage and some supplementary fine-tuning (e.g. loop unrolls and storage of intermediates parameters 


in constant memories). Further details about these strategies can be found in the works of Klockner et al. 


2009 2012 2013 


The update kernel performs the time-stepping (18) for all nodal values at a given time level I, stores the 
new values in arrays q and updates the traces in qf , as shown in algorithm Each work-item updates the 
nodal values associated to a given node, and each work-group deals with AbikS elements. The operations 
are again achieved in sub-tasks. First, the values are updated and stored in a shared array. After a 
synchronization, they are transported into arrays q and qf , requiring respectively Np and NfNfp work-items 
for the most inner loop. The size of the work-group is thus ATbikS ■ max(7Vp, NfNfp). This kernel is also used 
to update coarse neighbour elements with the time-stepping (19). The image kernel is completely analogous. 
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Algorithm 4: Update kernel 

input 

pointers *q, *qf, *rhs and *Fmask 
for each block b of elements do 

shared array qNew (array Afidds ■ Kuks ' Np) 
for each element k of block b do 
for each node n of element k do 
|_ compute the updated fields and store the values in qNew 

for each element k of block b do 
for each node n of element k do 
|_ update q with the new values stored in qnew 

for each element k of block b do 

for each face node nf of element k do 

read node index n corresponding to face node nf in Fmask 
update qf with the new values stored in qnew 


4.3 Towards an Implementation for Cluster of Devices 


The main challenge for large-scale computations with a large number of cores (with or without accelerators) 
is to reach a good parallel scalability for the final implementation. This requires to take care of two critical 
aspects: load balancing and managing the latency inherent to data transfers between compute nodes. 

In our implementation, the device associated to each compute node of the cluster performs the whole 
computation for a sub-domain corresponding to one part of the mesh. The procedure presented in the 
previous section is naturally applied to each device. At each local time-step, MPI communications transfer 
the traces of fields situated at the interface between two sub-domains when necessary. For the backward run 
of the RTM procedure, both source and receivers simulations are performed using the same mesh partition: 
this ensures the computation of the imaging condition without extra data transfers between processes. 

The mesh partition is generated using METIS with a weighting strategy. The weight Ni = is 


associated to each cell, where I is the time level of the cell. As explained in section |3.2[ Ni is also the 
number of local iterations required to achieve a global iteration for elements of level 1. This number then is 
proportional to the compute load required for the corresponding element, which allows to balance the global 
load between the compute nodes. 

Our experience has shown that the elements belonging to the finest time level were generally isolated in 


the mesh (see benchmark of section 5.2). In the multi-rate time-stepping strategy, these elements are updated 
at every smallest time-step, and only those are updated for half smallest time-steps. By preventing METIS 
to place them at interfaces between mesh parts, we remove the need for MPI communications when only 
those are updated, which represents half of all potential MPI communications. This is done by lumping each 
finest element (or group of finest elements) with his neighbors before METIS partition, and by separating 
them after the procedure. 


4.4 Memory Transfers for MPI Communications and RTM Procedure 

We now detail our strategies for the memory transfers that are required to connect sub-domains together, 
and to record/replay fields at the domain boundary for the source simulation in the RTM procedure. Both 
operations require different kinds of memory transfers: 

1. The connection of sub-domains is obviously made through MPI communications (host-host transfers). 
Data to send must be pre-fetched from device memory, while received data are loaded on the device 
memory right after. Host-device transfers are then required. 
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Figure 1: Flow chart of the complete RTM procedure with the forward phase (left) and the backward phase 
(right). Each iteration of loops corresponds to a local time-step update (forward or backward). For the 
backward phase, each kernel performs the same operations for both source and receivers simulations. Boxes 
with dashed borders correspond to operations that are non-blocking for the host. Successive boxes with gray 
background correspond to independent operations that can be performed at the same time. 

2. Boundary data recorded for the source simulation must be stored in a memory large enough. The 
device memory usually being too small, the data is accumulated and stored either in the host memory 
— more precisely, in its random-access memory (RAM) — or on a hard disk disk drive (HDD). We 
considered both options. This then requires host-device transfers with optionally RAM-HDD transfers. 

The flow chart of the complete RTM procedure with the memory transfers is shown in figure [l] Our goal 
was to hide latency inherent to transfers regardless of the kind of transfer. Therefore, we used non-blocking 
transfers whenever it was possible, in order to perform the different transfers alongside computations on the 
device. While our implementation runs on clusters with only CPUs, it is mainly conceived for clusters with 
accelerators. In that context, calling a kernel is a non-blocking operation for the compute (host) node. The 
operations mentioned in figurej^are performed by the compute node, and data transfers from/to are written 
from that point of view. We describe hereafter several of the strategies that were employed. 

The majority of data transfers are performed at the beginning of iterations, alongside kernels that do 
not need to immediately process transfered data: the volume kernel, the surface kernel for inner elements, 
and the source kernel. On the flow chart (figure]^, these operations correspond to the two main groups 
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of successive gray boxes. By calling twice the surface kernel, once for inner elements (that do not touch 
the border of the sub-domain) and once for outer elements, we increase the compute time available to hide 
data transfers. Before starting MPI communications, data to be sent to neighboring sub-domains is fetched 
from the device memory to the host memory. Because this data must already be on the host memory to 
start communications, blocking host-device transfers are used. By contrast, during the backward simulation, 
boundary data can be loaded onto the device memory with non-blocking host-device transfers. RAM-HDD 
exchanges are made in a blocking way after all the non-blocking operations, but occur alongside them. When 
all these operations are completed, data received with MPI communications is loaded on the device memory, 
and surface terms of outer elements are computed. This ends the computation of the right-hand-side terms. 

Different containers are used to store and to transfer boundary data during the RTM procedure. A 
specific buffer array is allocated on the device memory for temporary storage. When the surface kernel is 
executed for outer elements, the traces corresponding to boundary nodes are copied into that buffer during 
the forward phase, and are read from it during the backward phase. This explain why, during the forward 
phase, boundary data are transferred from the device memory to the host memory at the end of each 
iteration. In the host memory {i.e. the RAM of the compute node), accumulated boundary data is stored 
in a sequence container with double ended queue. At each forward iteration, a new container is added at 
one end of the sequence with the current boundary data. When using a HDD to unload the RAM, data 
is transferred to the HDD by reading the sequence container from its other end, with a time delay of one 
global time-step. This means that the sequence container always contains boundary data corresponding to 
one global time-step. Since both host-device and RAM-HDD transfers involve different pieces of data, they 
can be performed simultaneously. The strategy is identical for the backward phase, except that HDD-RAM 
transfers are performed one global time-step in advance. 


5 Computational Results 

5.1 Validation Case 


In order to validate our implementation, numerical convergence as well as run times, we use a reference 
benchmark proposed by Zhebel et al. 2014 and |Minisini et a~ [2012| . 

The computational domain of the benchmark is a cube D = [0.2 km] x [0.2 km] x [0.2 km] made of two 
media separated with a dipping plane interface (figure]^). The density is I g/cm^ in both media, and the 
velocity is 1.5 km/s in the upper one and 3 km/s in the lower one. The plane interface runs from 0.7 km in 
depth at a; = 0, to 1.3 km at x = 2 km. A Ricker pulse with a peak frequency of 12 Hz is generated at a 
point source located above the interface at position Xg = (779.7 m, 1000 m, 516.3 m), and creates a spherical 
wave (hgure[^). The peak of the Ricker pulse is generated at instant tp = 0.1127 s and the final time of the 
simulation is t/ = 0.8127 s. During the simulation, the pressure is recorded at a receiver situated at position 
Xr = (1023.9 m, 1000 m, 746.2 m). The recorded signal is compared to the exact solution over the period 
[tp,tf] using the relative error defined as Zhebel et al. 2014 


jPnum Prei\ 


\ It' brefi 


' dt 


(25) 


where Pnum{t) and p^eiit) are respectively the recorded and reference signals. Because of the chosen final 
time, the boundary conditions have no influence on the error. 

The convergence of the error for polynomial basis functions with different degrees is shown in hgure 
For each degree N, relative errors are obtained with several meshes (with 294 508, 567 071, 1 005 575 and 
2 320 289 tetrahedra). The time-stepping scheme is used with only one time level and the time-step 

At = 0.15 min | 1, 

fc \(V-hl)2cfcJ 
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(a) Velocity model 


(b) Snapshot of pressure 




Figure 2: Representation of the half part of the computational domain with the coarsest mesh (a) and 
snapshot of the pressure at one instant (b). The velocity is 1.5km/s in the upper medium (light gray) and 
3km/s in the lower one (dark gray). The source position (upper black point) and the receiver position (lower 
black point) are represented on both figures. The box indicates the total size of the computational domain. 



Figure 3: Numerical convergence of the numerical scheme for different polynomial degrees N with the 
validation case. For each degree, the relative error is plotted as a function of the cube root of the total 
number of node NpK. Dashed lines correspond to convergence. 


where £k is the smallest perpendicular distance between a face and its opposite vertex for tetrahedron k. 
As expected, the error curve is close to the classical convergence of upwind fluxes for all polynomial 
degrees. 

Figure]^ shows performance curves with a single Nvidia K20m GPU and the CUBA programming frame¬ 
work through OCCA. It confirms the interest of hp-refinement strategy. Indeed, an optimum polynomial 
degree exists depending on the desired accuracy. For instance, to reach relative error 10“^, the third degree 
is more efficient than the second degree with a finer mesh. By contrast, for the relative error 10“^, the fourth 
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Figure 4: Computational performance of the implementation for different polynomial degrees N and 4 
meshes with the validation case. The run times are obtained with one single Nvidia K20m GPU and CUBA 
through OCCA. For the highest degrees, simulations have not been performed with the finest meshes due to 
memory limitation of the GPU. 



Figure 5: Comparison of run times obtained with different hardware devices. Third-degree polynomial basis 
functions are used with a single-rate time-stepping scheme. 


degree is worthwhile compared to the fifth with a coarser mesh. Therefore, the best strategy to improve 
accuracy of solution is to refine the mesh and to increase the polynomial degree together. 

Finally, figure shows performance curves obtained with several GPUs: Nvidia Tesla K20m, Nvidia 
Tesla K40m, Nvidia GeForce GTX 980 and AMD Radeo HD 7970. All the Nvidia GPUs have been tested 
with CUBA and Nvidia OpenCL. The results are in accordance with the specihcations of these GPUs. The 
floating-point performances of both Tesla K40c and Radeon HD 7970 are similar. Tesla K20m and GTX 
980 have respectively the smallest and largest floating-point performances according to the specihcations of 
the constructor. For both Nvidia Tesla GPUs, the implementation is faster with CUBA than with OpenCL 
(speedup: ~ 1.65), while identical performances have been obtained with the GTX 980. Only the Tesla 
K40c has a enough large memory to deal the Hnest mesh, for which 6.16GB must be stored on the GPU. 
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5.2 RTM Case for a Cluster of GPUs 


This second benchmark deals with the complete RTM procedure for a larger geometry. The classical imaging 
condition @ and the new one based on characteristics 0 are compared numerically. Both scalability and 
memory transfers of our implementation are studied using up to 32 Nvidia K20 GPUs of TACC’s Stampede 
cluster. 

The geometry consists in a salt dome embedded in a multi-layered domain of size 8.9 x 4.44 x 5.1 km^. 
The top layer corresponds to sea water, while the other represent sediments and rocks. The velocity model 
is shown in figure [^a). The discretization consists of a tetrahedral mesh made of 2 787 335 elements with 
a third-degree polynomial basis, which corresponds to 222 986 800 discrete unknowns per simulation. The 
mesh was generated by [Kononov et ah 2012 and used by Minisini et al. 2013 


For this case, the multi-rate time-stepping scheme is used with 3 time levels. The coarse, intermediate 
and fine levels contain respectively 94.2%, 5.7% and less than 0.1% of elements. The multi-rate time-stepping 
is thus particularly interesting here. As shown in figure [^, most of the intermediate and fine elements are 
close to geometrical features (domain boundary and interfaces between layers). Since the smallest elements 
are isolated in the mesh, the lumping strategy in the partition procedure allows us to avoid MPl exchanges 
for the finest level without penalizing the load balancing between MPl processes. 


Synthetic Survey and Imaging Conditions 

A synthetic survey has been performed with 29 receiver gathers situated in the top layer of the domain. The 
position of the first source is = (8 km, 2.2 km, 8 m), and the corresponding receivers are placed at depth 
Zr = 10 m on the grid {xr, yr) S [5 km, 7.9 km] x [2.1 km, 2.3 km] with regular steps 25 m and 50 m. The 
same setting is used for the other gathers, but with positions moved 250 m, 500 m, ..., 6.75 km and 7 km in 
direction x negative. In all cases, a Ricker pulse is emitted at the source with the frequency 50 Hz, and the 
peak is generated at tp = 0.1127 s. The total simulation time is tf = 3.6127 s. The imaging conditions are 
computed over the period t/] with = 0.75s. The backward run was only performed for that period. The 
transparent condition of equation 0 is used at the domain boundary. The boundary values are saved and 
replayed for the source re-simulation, simultaneously with the reverse-time receiver wavefield computation. 

Figure [^hows the final images obtained with the classic imaging condition Q and the characteristics- 
based oneT^. The main geometrical features of the reflectors are imaged with both conditions, but low- 
frequency noise contaminates the image obtained with the classic condition. This noise is suppressed by the 
new characteristics-based condition, providing a cleaner image. A similar observation has been made by jLiuj 
et al. 2011 with their condition based on a spatial Fourier transform to decompose the wavefields into up- 
and downgoing components. In contrast to that work, we use vertical characteristics to separate up- and 
downgoing waves at a negligible supplementary cost compared to the classic condition. This decomposition 
is illustrated in figure We should emphasize that wavefields can be straightforwardly decomposed locally 
along any oblique direction by choosing characteristics with the corresponding direction vector instead of 
in formula dsl. 


Memory and Computational Performance 

For each shot, the RTM procedure has been performed using 32 GPUs of TACC’s Stampede cluster, and the 
gmsh software Geuzaine and Remade, 2009 was used for post-processing operations. 14.5GB of data must 


be allocated on GPUs, fairly well distributed between GPUs (from 353MB to 498MB by GPU). Boundary 
data are here saved only on RAM (no RAM-HDD transfers). They requires 240GB in total or, on average, 
7.5GB by compute node. However, the distribution between the compute nodes is not well balanced: one 
compute node does not store data (the corresponding sub-domain does not touch the domain boundary) and 
one node must store 14.34GB. This is fortunately not a problem because compute node are configured with 
32GB of host memory and, as shown later in this section, latency due to memory transfers between host 
(RAM) and device (GPU memory) is quite well hidden by computations. 

The run time for the complete procedure with one shot is 12.08 min, which includes a preparatory phase 
of 1.75 min (mesh partition, generation of lists, etc.), and respectively 4.21 min and 6.12 min for forward 
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(a) Velocity model 


(b) Time levels 





Figure 6: Velocity model (a) and time levels (b) for the RTM benchmark. Elements of the coarse, interme¬ 
diate and fine levels are respectively white, orange and blue on figure (b). The box indicates the total size 
of the computational domain. 
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(a) Image with classical condition 




Figure 7: Final images in the x — z plane for the RTM benchmark obtained with the classical imaging 
condition (a) and the new one based on characteristics (b). 


and backward phases. The preparatory phase can be done only once for the complete survey, since it is the 
same for all shots. Let us mention that the backward phase has been made only for the period 

Scalability and Memory Transfers 

Our implementation exhibits good strong scalability, despite high latency data transfers, as shown in figure 
1^ Run times are obtained using 4 to 32 GPUs, considering both single-rate and multi-rate time-stepping, 
and three alternative strategies to study memory transfers: boundary data are saved on a HDD, they are 
saved on the RAM associated to each compute node, and all slow communications are cut off (boundary 
data transfers and MPI communications). 

When using the HDD storage for boundary data, the scalability is lost for the forward phase with 24 
and 32 GPUs. Let us recall that this storage requires supplementary RAM-HDD transfers, which are slower 
than transfers between memory device and RAM. In this case, the RAM-HDD transfers, take too long to be 
hidden by the run time of kernels. Fortunately, for a large enough number of GPUs, boundary data can be 
stored only in RAM, and scalability is recovered. 
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(a) Pressure field p 





(b) Upgoing characteristic field 9 + = p + cp 


157 264 


7.63 8 9,^ 


(c) Downgoing characteristic field q = p — cp ■ w 



Figure 8 : Snapshots of pressure and upgoing/downgoing characteristics in the x — z plane for one source 
simulation at one instant. The same scale of color is used for all snapshots. Wavefronts with a vertical 
propagation direction are decomposed into upgoing and downgoing components, only present in the corre¬ 
sponding characteristic fields. Horizontal waves are present in both characteristic fields. For wavefronts with 
an oblique propagation direction, the decomposition depends on the angle with the vertical direction. 


21 








(a) Scalability for forward phase 



(b) Scalability for backward phase 



Figure 9: Strong scalability for the RTM benchmark with 4 to 32 Nvidia K20 GPUs. Run times are 
given separately for the forward phase (a) and the backward phase (b), both using the classical single-rate 
Adams-Bashforth (AB) scheme and using the multi-rate version with 3 time levels. In order to study the 
influence of memory transfers on performance, we consider cases where the boundary data are stored on 
a HDD, where they are only stored in the RAM associated to each compute node, and where both MPI 
node-to-node communications and boundary data transfers are disabled. 
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Using multi-rate time-stepping instead of the single-rate version provides a speed up between 3.4 and 
3.6, and preserves the scalability. MPI transfers and device-host boundary data transfers are not perfectly 
hidden by computations because of blocking operations, such as the loading of traces received from MPI 
communications (see figure]^. This however represent only few percent of the total run time, and does not 
impact the strong scalability. 

6 Conclusion 

We have presented a high-performance computational strategy for the RTM procedure with accelerator-aided 
clusters. It combines efficient numerical schemes, a flexible programming approach, and a low memory stor¬ 
age/transfer strategy for RTM. Our final implementation exhibits excellent strong scalability with massively 
parallel GPUs. In addition, we have introduced a novel imaging condition that reduces noise in the final 
image, without significant extra computational cost in comparison with classical strategies. We summarize 
hereafter key features contributing to performance: 

• The spatial scheme, based on a nodal discontinuous Galerkin method, enables high-order convergence 
rates. Since the weak element-to-element coupling and the dense algebraic operations required over 
each element, it is a suitable scheme for parallel computation on acceleration devices, especially with 
high-degree basis functions. 

• The levels-based multi-rate time-stepping scheme significantly improves run times for multi-scale cases 
with unstructured meshes, which are common in seismic imaging. For parallel computations, mesh 
partition with METIS preserves good load balancing between cores thanks to strategies based on 
weights and lumping of elements. 

• The programming approach with MPI and OCCA provides a portable implementation that can be run 
on state-of-the-art computing architectures, and can be optimized for each of them. While the results 
presented herein are obtained with kernels optimized for GPUs, tuning for alternative architectures 
requires to modify only a few parameters and, eventually, only a few code lines. 

• While RTM procedure generally requires massive data storage with slow I/O, the thin halo regions 
inherent in DG discretizations overcome this bottleneck. Low storage requirements for DG boundary 
data allows halo trace data to be stored in RAM rather than relying on HDD, even for larger test cases. 
In the worst cases, where available RAM is not large enough, data exchanges between RAM and HDD 
can be performed asynchronously with a limited, possibly insignificant, increase of run time. 

• Several optimization strategies for data storage and data movement improve run time: locality of stor¬ 
age, specific granularity of storage to enable cache reuse, and hidden/asynchronous memory transfers 
for device-host and MPI node-to-node communications. 

Future work includes the implementation of domain truncation methods to accurately deal with the artificial 
boundaries of the computational domain. Techniques such as perfectly matched layers and complete radiation 
boundary conditions improve the solution close to artificial boundaries, but it would be challenging to 
incorporate them in our computational strategy without penalizing too much both run time and scalability. 
We are also preparing a new implementation for hybrid meshes made of hexahedra, tetrahedra, prisms 
and pyramids. DG implementations are particularly efficient for hexahedral elements, but mesh generators 
can only generate meshes dominated by hexahedra with a small amount of other elements. In our RTM 
procedure, several compression strategies can be tested to reduce the size of boundary data. Finally, in the 
future, we will propose cases involving more complicated physics, with anisotropic media and elastic waves, 
and generalization of the new characteristics-based imaging condition. 
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A Derivation of the Nodal DG Scheme 


The semi-discrete scheme is obtained from the formulation ([n])-@ by replacing the fields by their approx¬ 
imations ([^-(10), and by using the Lagrange polynomial functions as test functions. For each node n of 
element k, we then have 
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with j = 1,2, 3. In this system, Nf is the number of faces by element, Nfp is the number of nodes by face, 
n' is the face node index, and the corresponding node index in the list of all element nodes is denoted with 
n{n'). We have defined the penalties evaluated at each face node as 
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where Ui^k.f is the Gartesian component of the outward unit normal of the face d£)kj ■ The star subscript 
denotes the nodal values on the side of the neighboring element. The components of the so-called local mass, 
local stiffness and local face mass matrices are respectively 


^k,mn 

1 ^k,m^k,n 


•tDfc 

Sk,i,mn — 

f f! d£k,n J 

/ h,m , d:s., 

JDfc dxi 

^k,f,7nn' ~ 

1 ^k,m^k,n{n') 


•’90k.f 


27 







Inverting the local mass matrix, the system can then be written as 
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where q^ and q^^ contain the discrete unknowns associated to each field for all nodes of element k. The 
vectors j and P'kj contain the penalties for all face nodes of face /. and Sk,i are Np x Np matrices, 
while Mfc j is a Np x Nfp matrix. 

Since an affine transformation connects each element to a reference element, the local matrices can be 
expressed in terms of reference matrices that are the same for each element, combined with scaling and linear 
combinations. Denoting the transformation from the reference cell I to each cell with :r€ I—D^, 
we have 
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where Jk = det(9’®'fe/9r) is the Jacobian of the transformation. Similarly, each face is related to a reference 
face with an ad hoc affine transformation. This permits to express all local face mass matrices in term of 
a reference two-dimensional mass matrix, involving the Jacobian of this transformation Jkj as well as a 
Np X Nfp matrix which connects the face node indices with the node indices in the list of all nodes. 

For each element, the system of equations can finally be written in the convenient form 
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where Dj are the differentiation matrices and Lj are the lifting matrices for the reference element. Defining 
the geometric factors 
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one obtains equations (13)-(14). 
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